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Abstract. Wc consider a convex relaxation of sparse principal component analysis proposed by d'Aspremont et al. in [9]. 
This convex relaxation is a nonsmooth scmidefinite programming problem in which the £i norm of the desired matrix is imposed 
in either the objective function or the constraint to improve the sparsity of the resulting matrix. The sparse principal component 
is obtained by a rank-one decomposition of the resulting sparse matrix. Wc propose an alternating direction method based on 
a variable-splitting technique and an augmented Lagrangian framework for solving this nonsmooth semidefinite programming 
problem. In contrast to the first-order method proposed in [9] that solves approximately the dual problem of the original 
semidefinite programming problem, our method deals with the primal problem directly and solves it exactly, which guarantees 
that the resulting matrix is a sparse matrix. Global convergence result is established for the proposed method. Numerical 
results on both synthetic problems and the real applications from classification of text data and senate voting data are reported 
to demonstrate the efficacy of our method. 
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1. Introduction. Principal component analysis (PCA) plays an important role in applications arising 
from data analysis, dimension reduction and bioinformatics etc. PCA finds a few linear combinations of the 
original variables. These linear combinations, which are called principal components (PC), are orthogonal 
to each other and explain most of the variance of the data. Specifically, for a given data matrix M G 
which consists of n samples of the p variables, PCA corresponds to a singular value decomposition (SVD) 
of M or an eigenvalue decomposition of the sample covariance matrix S ~ MM^ G R''^''. Thus, for a given 
sample covariance matrix E, PCA is usually formulated as an eigenvalue problem: 

(1.1) X* :=argmax x^TjX^ s.t. ||-'J;|j2 < 1, 

where ||a;||2 is the Euchdean norm of vector x. Problem (|l.ip gives the eigenvector that corresponds to the 
largest eigenvalue of E. However, the loading vector x* is not expected to have many zero coefficients. This 
makes it hard to explain the PCs. For example, in the text classification problem, we are given a binary 
data matrix M G W^" that records the occurrences of p words in n postings. That is, Mij = 1 if the 
i-ih word appears in the j-th posting and il/y = if the i-th word does not appear in the j-th posting. 
The standard PCA cannot tell which words contribute most to the explained variance since the loadings are 
linear combinations of all the variables. Thus, sparse PCs are needed because it is easier to analyze which 
variables contribute most to the explained variance. 

Many techniques were proposed to extract sparse PCs from given sample covariance matrix E or sample 
data matrix M. One natural thought is to impose a cardinality constraint to (|l.ip . which leads to the 
following formulation for sparse PCA: 

(1.2) a;*:=:argmax x^Ex, s.t. ||a;||2 < 1, ||a;||o ^ -f^^; 
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where ||.i;[|o (the io norm of x) counts the number of nonzeros of x and the integer K controls the sparsity 
of the solution. Since the cardinality constraint ||a;||o < K makes the problem numerically intractable, many 
different models were proposed in the literature to overcome this difficulty. 

In [3], d'Aspremont et al. proposed to approximately solve (|1.2p by its convex relaxation, which is a 
nonsmooth semidefinite programming (SDP) problem. This is the first work that attempts to approximately 
solve (|1.2p by a convex problem. The SDP formulation is based on the lifting and projection technique, 
which is a standard technique in using SDP to approximate combinatorial problems (see e.g., [U |3l 134)). 
Note that if we denote X = xx^ , then (|1.2|) can be rewritten as 

(1.3) max s.t. Tr{X) = 1, ||X||o <K'^,X)^ 0,rank(X) = 1}, 

where Tr(X) denotes the trace of matrix X. The rank constraint is then dropped and the cardinality 
constraint is replaced by €i norm constraint, and this leads to following convex problem, which is an SDP. 

(1.4) max s.t. TriX) = 1, ||X||i <K,X)^ 0}, 

XGRPXP 

where the li norm of X is defined as ||X||i := \Xij \ and using the convex constraint \\X\\i < K to impose 
the sparsity of the solution is inspired by the recent emergence of compressed sensing (see e.g., [SlITO]). Note 
that II X 111 < K is used in (|1.4p instead of ||X||i < . This is due to the fact that, when X = xx^ and 
Tr{X) = 1, we have ||X||f = 1, and also that if ||X||o < then ||X||i < K\\X\\f. After the optimal 
solution X* to (|1.4p is obtained, the vector x from the rank-one decomposition of X*, i.e., X* = xx^ is 
used as an approximation of the solution of (|1.2p . This is the whole procedure of the lifting and projection 
technique. Although some standard methods such as interior point methods can be used to solve the SDP 
(jl.4[) (see e.g., [IIISIIM]); it is not wise to do so because (|1.4p is a nonsmooth problem, and transferring it 
to a standard SDP increases the size of the problem dramatically. 

It is known that (jl.4p is equivalent to the following problem with an appropriately chosen parameter 
p>0: 

(1.5) max {{^,X) - p\\X\\i s.t. Tr(X) = 1,X hO}- 
Note that (|1.5p can be rewritten as 

(1.6) max min (S + J/,^), 

X^O,Tr(X) = l ||C/||oo<P 

where ||?7||oo denotes the largest component of U in magnitude, i.e., ||t/||oo — maxy \Uij\. The dual problem 
of (jl.Sp is given by interchanging the max and min in (jl.6p . i.e., 

min max CE + U^X), 

\\U\\^<p Xyo,Tr{X) = l 

which can be further reduced to 



(1.7) min A,„ax(S + ?/), s.t. \\U\\oo < P, 

where Xmax{Z) denotes the largest eigenvalue of matrix Z. d'Aspremont et al. [9] proposed to solve the dual 
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problem (|1.7p using Nesterov's first-order algoritlim (see e.g., [371[2B|)i which is an accelerated projected 
gradient method. However, since the objective function of ()1.7|) is nonsmooth, one needs to smooth it in 
order to apply Nesterov's algorithm. Thus, the authors of [3] actually solve an approximation of the dual 
problem (|1.7p . which can be formulated as follows. 



(1.8) min f^{U), s.t. ||C/||oo < P, 

where /i > is the smoothing parameter, f^{U) := max{(S + U,X) — iid{X), s.t. Tr{X) — 1^X>Q} and 
d{X) Tr(XlogX) + log(n). It is shown in [9] that an approximate solution X^ to the primal problem 
(|1.5|) can be obtained by X^ = V/^ ([/'""), where U'^ is an approximate solution of (|1.8p . It is easy to see that 
X'' is not guaranteed to be a sparse matrix. Besides, although there is no duality gap between (|1.5p and 
(|1.7p . the authors solve an approximation of (|1.7p . It needs also to be noted that Nesterov's algorithm used 
in [5] cannot solve the constrained problem (|1.4p . Although (|1.4p and (|1.5p are equivalent with appropriately 
chosen parameters if and p, in many applications, it is usually easier to choose an appropriate K since we 
know how many nonzeros are preferred in the sparse PCs. 

Nonconvex reformulations of (|1.2p include the followsings. Zou et al. [121 considered a regression type 
formulation of (|1.2p with Lasso and elastic net regularizations. d'Aspremont et al. [5] considered a penalty 
version of (11.21). 



(1.9) (f){p) = max I Ex — p||a:||o. 

Ikll2<l 

d'Aspremont [8] showed that (|1.9p is equivalent to the following problem that maximizes a convex function 
over spherical constraint: 



(1.10) ^(p)- max £((a7x)2-p) 

k 2=1 ^— : 



where (a)+ := max{a, 0}, S = and a,; is the i-th column oi A € MP^p. Clearly, (|1.10p is a non-convex 

problem. d'Aspremont et al. thus proposed in [8] to solve (|1.10p by a greedy method. Journee et al. [22] 
considered the same formulation (jl.lOp and proposed a gradient type method, which is actually a generalized 
power method, to solve it. 



Note that (|1.2p only gives the largest sparse PC. In many applications, several leading sparse PCs are 
needed in order to explain more variance. Multiple sparse PCs are usually found by solving a sequence of 
sparse PCA problems (|1.2p . with E constructed via the so-called deflation technique for each sparse PC. Lu 
and Zhang [24] proposed the following model to compute the leading r sparse PCs of E simultaneously: 

max Tr(V^j:V) - p\\V\\i 

(1-11) s.t. ly^s^^.j < A,„Vz^j, 

V^V = I, 

where each column of V corresponds to a loading vector of the sample covariance matrix E and Ay > 0{i ^ j) 
are the parameters that control the correlation of the PCs. Lu and Zhang [2j proposed an augmented 
Lagrangian method to solve (jl.lip . Note that for these nonconvex formulations, algorithms proposed in the 
literature usually have only local convergence and global convergence is not guaranteed. 
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In this paper, we propose an alternating direction method based on a variable-splitting technique and 
an augmented Lagrangian framework for solving directly the primal problems (|1.4|) and ()1.5p . Our method 
solves two subproblems in each iteration. One subproblem has a closed-form solution that corresponds to 
projecting a given matrix onto the simplex of the cone of semidefinite matrices. This projection requires an 
eigenvalue decomposition. The other subproblem has a closed-form solution that corresponds to a vector 
shrinkage operation (for Problem (|1.5p ) or a projection onto the l\ ball (for Problem (|1.4[) '). Thus, our 
method produces two iterative points at each iteration. One iterative point is a semidefinite matrix with 
trace equal to one and the other one is a sparse matrix. Eventually these two points will converge to the same 
point, and thus we get an optimal solution which is a sparse and semidefinite matrix. Compared with the 
Ncsterov's first-order method suggested in [9] for solving the approximated dual problem ()1.8|) . our method 
can solve the nonsmooth primal problems (|1.4p and (|1.5p uniformly. Also, since we deal with the primal 
problems directly, the l\ norm in the constraint or the objective function guarantees that our solution is a 
sparse matrix, while Ncsterov's method in [21 does not guarantee this since it solves the approximated dual 
problem. 

The rest of the paper is organized as follows. In Section[21 we introduce our alternating direction method 
of multipliers for solving the nonsmooth SDP problems (|1.4p and (|1.5p . The global convergence results of the 
alternating direction method of multipliers are given in Section [31 We discuss some practical issues including 
the deflation technique for computing multiple sparse PCs in Section |4l In Section [Sj we use our alternating 
direction method of multipliers to solve sparse PCA problems arising from different applications such as 
classification of text data and senate voting records to demonstrate the efficacy of our method. We make 
some conclusions in Section |6l 

2. Alternating Direction Method of Multipliers. We first introduce some notation. We use C to 
denote the simplex of the cone of the semidefinite matrices, i.e., C = {X £ Rp^p \ Tr(X) = 1, AT >r 0}. We 
use B to denote the ^i-bah with radius K in i.e., B = {X e Wp \ \\X\\i < K}. Ia{X) denotes the 

indicator function of set A, i.e.. 



(2.1) Ia{X) 



if A e ^, 
+00 otherwise. 



We know that Ic{X) and /^(A) are both convex functions since C and B are both convex sets. We then can 
reformulate (|1.4p and (jl.Sp uniformly as the following unconstrained problem: 

(2.2) min ^{Y., X) + Ic{X) + h{X), 

where h{X) = /e(A) for dTll) and h{X) = p\\X\\i for ([LS]). Note that h{X) is convex in both cases, (g^ 
can be also viewed as the following inclusion problem: 



(2.3) Find A, s.t. £ -E + dIc{X) + dh{X). 

Problem p.3p finds zero of the sum of two monotone operators. Methods based on operator-splitting 
techniques, such as Douglas-Rachford method and Peachman-Rachford method, are usually used to solve 
Problem (|2.3p (see e.g., [TTl [30l [23l [131 [H E [T])- From the convex optimization perspective, the alter- 
nating direction method of multipliers (ADMM) for solving (|2.2p is a direct application of the Douglas- 
Rachford method. ADMM has been successfully used to solve structured convex optimization problems 
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arising from image processing, compressed sensing, machine learning, semidefinite programming etc. (see 
e.g., [llllIIlEelEHlIIllllllliainillllEelE^ show how ADMM can be used to solve the sparse 

PCA problem (j^ . 



ADMM is based on a variable-splitting technique and an augmented Lagrangian framework. By intro- 
ducing a new variable Y , (j2.2p can be rewritten as 



min -{j:,X)+Ic{X) + h{Y) 
s.t. X = Y. 

Note that although the number of variables is increased, the two nonsmooth functions Ic{-) and h{-) arc now 
separated since they are associated with different variables. For this equality-constrained problem, augmented 
Lagrangian method is a standard approach to solve it. A typical iteration of augmented Lagrangian method 
for solving (|2.4p is given by: 

(2.5) I ™ ' ' ^ 

^k+i _ A'^ - i(x'=+i -r'^+i), 

where the augmented Lagrangian function C,_i{X^ Y] A) is defined as: 

(2.6) C^{X, Y- A) -(S, X) + Ic{X) + h{Y) -{K,X~Y) + - Y\\l, 

where /i > is a penalty parameter and A is the Lagrange multiplier associated with the linear constraint 
X = Y. Note that it is usually hard to minimize the augmented Lagrangian function L^{X,Y;A'^) with 
respect to X and Y simultaneously. In fact, it is as difficult as solving the original problem (|2.4[) . However, 
if we minimize the augmented Lagrangian function with respect to X and Y alternatingly, we obtain two 
subproblems in each iteration and both of them are relatively easy to solve. This results in the following 
alternating direction method of multipliers. 




= argminx£^(X,r^A'=) 
= argminy/:^(A:'=+i,y;A'=) 

= A'= - (x*^+i -y'^+i)//i, 



(2.7) 



It can be shown that the two subproblems in (|2.7[) arc both relatively easy to solve in the sparse PCA 
problem. Before wc do that, we characterize two nice properties of the indicator function (j2.ip . 



• Property 1. The proximal mapping of the indicator function Ia{.') is the Euclidean projection onto 
A, i.e., 

(2.8) wo^i^{X) = rA{X), 
where 

(2.9) prox,^(^) := argmm{/^(C/) + \\\U - Xfp}, 
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and 

(2.10) VAiX) := aigmm{^\\U - X\\l, s.t. U e A}. 
• Property 2. The optimality conditions for Problem (|2.10p are given by 

(2.11) X - U* e dlA{U*), 
which is equivalent to 

(2.12) {X ^u*,z -u*) <o,yz e A, 

where U* is the optimal solution of (|2.10p . 
Now, the first subproblem in (|2.7p can be reduced to: 



(2.13) X^+i := argmin|/i/c(X) + - (Y" + fiA" + , 
which can be further reduced to projection onto C using Property 1, 

(2.14) X'^+i =Pc(y''+M''+A«S) :=argmin|i||X-(y^+/iA''+y"S)|l2,, s.t. X G c| 
When h{Y) = Ib{Y) as in Problem (|1.4p . the second subproblem in (|2.7p can be reduced to: 

(2.15) y'^+i := argmin Lib{Y) + l\\Y- {X'^+' - M')IIf| , 



2 

which can be further reduced to projection onto B using Property 1, 

(2.16) F'^+i = Pb{X^+^ - M'') := argmin - (X^+i - fiA'')\\l, s.t. Y eB 
When h{Y) = p\\Y\\i as in Problem (|1.5p . the second subproblem in (|2.7p can be reduced to: 

(2.17) r'^+i := argmin + ^WY - {X^+^ - M')IIf 
Problem (|2.17p has a closed-form solution that is given by 

(2.18) y'^+i = Shrink(X''^+i - M^ 
where the shrinkage operator is defined as: 

(2.19) (Shrink(Z, r)),, := sgn(Zy) max{|Z,y | - r, 0}, Vi, j. 

In the following, we will show that (|2.13l) and (|2.15p are easy to solve, i.e., the two projections (|2.14p 
and (|2.16l) can be done efficiently. First, since the problem of projection onto C 

(2.20) Vc{X) = a.YgimxY{]^\\Z - X\\l, s.t. Tr(Z) = l,Z^O} 



is unitary-invariant, its solution is given by Vc{X) = U diag(7)f7^, where X = U d\a,g{a)U^ is the eigenvalue 
decomposition of X , and 7 is the projection of a onto the simplex in the Euclidean space, i.e.. 



1 

(2.21) 7:=argmin{-||e-(T||2, s.t. ^e4-l,e>0}. 

4=1 

We consider a slightly more general problem 

1 ^ 

(2.22) r :=argmin{-||e-a||2, s.t. Y.^^^t,^>Q}, 

1=1 

where scalar r > 0. Note that p.2ip is a special case of (|2.22p with r = 1. From the first-order optimality 
conditions for (|2.22p . it is easy to show that the optimal solution of (|2.22[) is given by 

£* := max{(Ti - 0, 0}, Vi = 1, . . . ,p, 

where the scalar 9 is the solution of the following piecewise linear equation: 

p 

(2.23) ^ max{crj - 6*, 0} = r. 

It is known that the piecewise linear equation (|2.23|) can be solved quite efficiently and thus solving (j2.22p 
can be done easily. In fact, the following procedure (Algorithm [T]) gives the optimal solution of (|2.22l) . We 
refer the readers to [31] for the proof of the validity of the algorithm. It is easy to see that Algorithm [T] has 

Algorithm 1: Projection onto the simplex in the Euclidean space 
Input: A vector ct G and a scalar r > 0. 
Sort a into as a non-decreasing order: cti < (72 < • • • < o'p 

/ P \ 

Find index j, the smallest j such that aj — ^J^jj^i | ^ J' > 

Compute 9 = a,-r 

Output: A vector 7, s.t. 7^ = maxjcTi — 9, 0}, i = 1, . . . ,p. 



an 0{p\ogp) complexity. Linear time algorithms for solving (j2.22|) are studied in [4l[29l[T2]. Thus, solving 
(|2.13p corresponds to an eigenvalue decomposition and a projection onto the simplex in the Euclidean space, 
and they both can be done efficiently. 

Solving p.isp (or equivalently ()2.16|) ) corresponds to a projection onto the ^i-ball: ||yj|i < K. It has 
been shown in [12l |35] that projection onto the £i-ball can be done easily. In fact, the solution of 

(2.24) 7 = argmin{i||e - cr\\l s.t. ||e||i < r} 

is given by 7^ = sgn((Ti)7i, Vi = 1, . . . ,p, where 7 is the solution of 

1 ^ 
min s.t. ^ 7^ = r, 7 > 0, 

4 = 1 
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i.e., the projection of \(t\ (elementwise absolute value of a) onto the simplex. Thus, (|2.15p can be rewritten 
as 

(2.25) vec(y'=+i) = argmin{iliy - vec(X^-+i - M')ll2, s.t. ||yl|i < if}, 

and it corresponds to a projection onto the simplex in the Euclidean space, where vec(y) denotes the vector 
form of Y which is obtained by stacking the columns of Y into a long vector. 

To summarize, our ADMM for solving (|1.4p and (|1.5|) can be uniformly described as Algorithm [5J 

Algorithm 2: ADMM for solving ([Li)) and pT5|) 

Initialization: y° = 0, A° = 0. 
for k=0,l,. . . do 

Compute the eigenvalue decomposition: Y^ + /iA'^ + /iS = J7 diag(cr)[/^ 

Project (T onto the simplex in Euclidean space by Algorithm [TJ and denote the solution by 7 

Compute X^+^ = C/diag(7)C/^ 

Perform one of the followings: 

• if ([L4| is solved, update F'^+i by solving \2.2^ 

• if (US]) is solved, update Y^+^ by (f2T8| 
_ Update A'=+i by A'^+i = A*= - {X^+^ - Y^+^)/^l 



Remark 2.1. Although Algorithm\^ suggests that we need to compute the eigenvalue decomposition of 
Y^ + liK^ + ^iYj in order to get the solution to p.l3p . we actually only need to compute the positive eigenvalues 
and corresponding eigenvectors of Y^ + jiK^ + /.tS . 

3. Global Convergence Results. In this section, we prove that the sequence {X^ ,Y^ ,h.^) produced 
by the alternating direction method of multipliers (|2.7p (i.e.. Algorithm [2]) converges to A*), where 

{X* ,Y*) is an optimal solution to (|2.4p and A* is the corresponding optimal dual variable. Although the 
proof of global convergence results of ADMM has been studied extensively in the literature (see e.g., [HI [20]), 
we here give a very simple proof of the convergence of our ADMM that utilizes the special structures of the 
sparse PCA problem. We only prove the case when h{Y) ~ Ib{Y) and leave the case when h{Y) = 
to the readers since their proofs are almost identical. 

Before we give the main theorem about the global convergence of (|2.7p (Algorithm [2]) , we need the 
following lemma. 

Lemma 3.1. Assume that {X* ^Y*) is an optimal solution of \2A\ and A* is the corresponding optimal 
dual variable associated with the equality constraint X = Y. Then the sequence (X*^, y'^, A'^) produced by 
([2J)) satisfies 



(3.1) lit/'-' - u*\\l - \\U^'+^ - u*\\l > wu'' - t/^+Mlc. 




where U* ^ ["^^], ( ^^J and G = , and the norm || • ||^ is defined as ||C/||^ (C/, GU) 

and the corresponding inner product (•, ■)q is defined as {U, V)g ~ {U, GV). 

Proof Since {X* ,Y* , A*) is optimal to (|2.4p . it follows from the KKT conditions that the followings 
hold: 



(3.2) 



e - j: + dic{x*) - A* , 
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(3.3) OGa/e(F*) + A*, 
and 

(3.4) x*=Y*eCnB. 

By using Property 2, p.2|) and p.3|) can be respectively reduced to: 

(3.5) (I] + A*,X-X*) <0,VXeC, 
and 

(3.6) (-A*,r-r*) <o,vr ei3. 

Note that the optimahty conditions for the first subproblem (i.e., the subproblem with respect to X) in 
(P?7|) arc given by X'^+i e C and 

(3.7) G -s + a/c(x'=+i) - A'^ + -(x'^+i - y*^'). 

By using Property 2 and the updating formula for A'^ in p.7p . i.e., 

(3.8) A'=+i=A'^--i(A:*=+i-y'^+i), 

p.7p can be rewritten as 

(3.9) (s + A'^+i + -(y'=-r'^+i),x-x'=+i) <o,vx6C. 

Letting X = X''^^ in p.Sp and X = X* in p.9p . and summing the two resulting inequalities, we get, 

(3.10) (A'^+i - A* + -(r*^ - Y''+^),X* - AT'^+i) < 0. 

^^ 

The optimality conditions for the second subproblem (i.e., the subproblem with respect to Y) in (j2.7p 
are given by y'^+i e S and 

(3.11) e (9/B(r'^+i) + A'^ + 1(^*^+1 - x'^+i). 

By using Property 2 and p.8p . p. lip can be rewritten as 

(3.12) (-A''+\Y -Y''+^) <0,yY e B. 

Letting Y = y'^+i in p.6p and F = F* in p.l2p . and summing the two resulting inequalities, we obtain, 

(3.13) (A* - A*^+\y* -y'^+i) < 0. 

Summing (|3l0l) and ([3l3| . and using the facts that X* = Y* and X'^+i = /x(A'= - A^^'+i) + y^^'+i, we 
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obtain, 

(3.14) - h^+^,h^+^ - A*) + i(r'= - r'^+i _ y*) > ~{y'' ~ Y^"-^,k^ - A'^^+i). 

A* 

Rearranging the left hand side of (|XT4l) by using A'^+i - A* = (A'^'+i - A'^) + (A'^ - A*) and Y^"^^ - Y* = 

(yfc+i - y'^) + (yfe - y*), we get 

(3.15) 

/i(A'=-A*,A'=-A'^'+i) + i(y'=-r*,y'^-y'^+i) > fi\\A''-A''+^f + -\\Y''-Y''+^f-(A''+^-A'',Y''+^-Y''}. 
Using the notation of [/'"', U* and G, p.lSp can be rewritten as 

(3.16) {U'' - U*, U'' - U''+^)g >\\U'' - U'^+^Wl - (A'^' - A'-'+i, r'= - Y''+^). 

Combining p.l6p with the identity 

we get 

= 2([/^--[/^-+i,L/^-L/*)-||L/^+i-J7'=||^ 
^ -1 ) > 2||C/'^'-J7'^+i|[2,_2(A'^-A^-+i,r^--y'=+i)-||C/^-+i-C/'^-||2j 

Now, using p.l2p for k instead of fc + 1 and letting Y = we get, 

(3.18) (-A'=,y'=+i -r*^) <o. 

Letting F = F*^ in ((51^ and adding it to ((XTO)) yields, 

(3.19) {A'' - A''+\y'' -Y''+^) <0. 

By substituting p. 191) into p.l7p we get the desired result (|3.ip . □ 

We are now ready to give the main eonvergenee result of p.7p (Algorithm [2]) . 

Theorem 3.2. The sequence {{X'^ ,Y'' , A'')} produced by jSj]) (Algorithm\^ from any starting point 
converges to an optimal solution to Problem ()2.4p . 

Proof. From Lemma 13.11 we can easily get that 

• (i) \\U' - U'+^\\g ^ 0; 

• (ii) {U'^} lies in a compact region; 

• (iii) ||[/'"' — U*\\q is monotonically non-increasing and thus converges. 

It follows from (i) that A*^ - A'^+i ^ and Y'' - Y''+^ 0. Then ([31]) implies that - X^+^ and 
j^fe _ yfc ^ 0. From (ii) we obtain that, U'^ has a subsequence {U'^^} that converges to ?7 = (A,y), i.e., 
A'^^ ^ A and F'^'j ^ Y. From X'^ - F'^ ^ we also get that X''^ X ■.= Y. Therefore, {X, Y, A) is a hmit 
point of {(X'^-,y'^-,A'^-)}. 
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Note that by using p.8p . p.7p can be rewritten as 



(3.20) 



k+l 




which imphes that 



(3.21) 



e -T. + dIc{X)- A. 



Note also that (|3.1ip iniphcs that 



(3.22) 



6 dlB{Y)+A. 



Moreover, it follows from X'' e C and Y'' e B that 



(3.23) 



X G C and y e B. 



([MTjl . ([3?22|) . (|3?23| together with X ^Y imply that (X,r, A) satisfies the KKT conditions for ([23]) and 
thus is an optimal solution to ^2A\i . Therefore, we showed that any limit point of {{X'' ,Y'' , A'^)} is an 
optimal solution to (|2.4[) . □ 

4. The Deflation Techniques and Other Practical Issues. It should be noticed that the solution 
of Problem only gives the largest eigenvector (the eigenvector corresponding to the largest eigenvalue) of 
S. In many applications, the largest eigenvector is not enough to explain the total variance of the data. Thus 
one usually needs to compute several leading eigenvectors to explain more variance of the data. Hotelling's 
deflation method [32] is usually used to extract the leading eigenvectors sequentially. The Hotelling's deflation 
method extracts the r-th leading eigenvector of S by solving 



It is easy to verify that Hotelling's deflation method preserves the positive-semidefiniteness of matrix S^. 
However, as pointed out in |25] , it does not preserve the positive-semidefiniteness of when it comes to the 
sparse PCA problem (|1.2p . because the solution Xr is no longer an eigenvector of S^-i- Thus, the second 
leading eigenvector produced by solving the sparse PCA problem may not explain well the variance of the 
data. We should point out that the deflation method used in [9] is the Hotelling's deflation method. 

Several deflation techniques to overcome this difficulty for sparse PCA were proposed by Mackey in [25] . 
In our numerical experiments, we chose to use the Schur complement deflation method in |25) . The Schur 
complement deflation method updates matrix by 



a;r=argmax{x "Er-ix, s.t. ||a;||2 < 1}, 



where So '■= S and 




(4.1) 



= E, 




The Schur complement deflation method has the following properties as shown in |25| . (i) Schur complement 
deflation preserves the positive-semidefiniteness of Er, i.e., E^ ^ 0. (ii) Schur complement deflation renders 

11 



Xs orthogonal to for s < r, i.e., S^Xg = 0,Vs < r. 

When we want to find the leading r sparse PCs of S, we use ADMM to solve sequentially r problems 
(|1.4p or (jl.Sp with S updated by the Schur complement deflation method (|4.ip . We denote the leading 
r sparse PCs obtained by our ADMM as Xr = {xi, . . . ,Xr)- Usually the total variance explained by Xr 
is given by Tr(X^EX-r). However, because we do not require xi,. . . ,Xr to be orthogonal to each other 
when we sequentially solve the SDPs p.4p or (|1.5p . these loadings are correlated. Thus, Tr{XjY.Xr) will 
overestimate the total explained variance hy xi, . . . ,Xr- To alleviate the overestimated variance, Zou et al. 
[42] suggested that the explained total variance should be computed using the following procedure, which 
was called adjusted variance: 

AdjVar{Xr) := Tr{R^), 

where Xr = QR is the QR decomposition of Xr- In our numerical experiments, we always report the adjusted 
variance as the explained variance. 

It is also worth noticing that the problems we solve are convex relaxations of the original problems (|1.2p 
and (|1.9[) . Hence, one needs to postprocess the matrix X obtained by solving (jl.4p or (|1.5p to get the solution 
to (|1.2p or (|1.9p . To get the solution to the original sparse PCA problem (|1.2p or (|1.9p from the solution X 
of the convex SDP problem, we simply perform a rank-one decomposition to X, i.e., X = xx^ . Since X is 
a sparse matrix, x should be a sparse vector. This postprocessing technique is also used in [9j. 

Since the sequences {X''} and {Y'^} generated by ADMM converge to the same point eventually, we 
terminate ADMM when the difference between X'^ and Y'^ is sufficiently small. In our numerical experiments, 
we terminate ADMM when 

max{l,||X'=||f^,|iy'^||^^} ^ 

5. Numerical Results. In this section, we use our ADMM to solve the SDP formulations (|1.4p and 
(jl.Sp of sparse PCA on both synthetic and real data sets. We compare the performance of ADMM with 
two methods for solving sparse PCA. One method is DSPCA [9] for solving (jl.Sp and the other method is 
ALSPCA EH for solving p^TT]) . The Matlab codes of DSPCA and ALSPCA were downloaded from the 
authors' websites. Note that the main parts of the DSPCA codes were actually written in C-Mex files. Our 
codes were written in Matlab. All experiments were run in MATLAB 7.3.0 on a laptop with 1.66GHZ CPU 
and 2GB of RAM. 

5.1. A synthetic example. We tested our ADMM on a synthetic data set suggested by Zou ct al. in 
[42) . This synthetic example has three hidden factors: 

Vi - 7V(0, 290), V2 - A/'(0, 300), F3 = -0.3^1 + 0.925^2 + e, 

where e ^ M{0, 1), and Vi, V2 and e are independent. The 10 observable variables are given by the following 
procedure: 

X, = Vi+el ei^A/-(0,l), z = 1,2,3,4, 
X, = V2 + el e2~A/-(0,l), ^ = 5,6,7,8, 
X^ = V3 + el e3_A/'(0,l), z = 9, 10, 
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where are independent for j = 1, 2, 3 and i ~ 1, . . . , 10. The exact covariance matrix of (Xi, . . . , Xiq) is 
used to find the standard PCs by standard PCA and sparse PCs by ADMM. Note that the variances of Vi , V2 
and V3 indicate that V2 is shghtly more important than Vi and they are both more important than V3. Also, 
we note that the first two PCs explain more than 99% of the total variance. Thus, using the first two PCs 
should be able to explain most of the variance, and the first sparse PC explains most of the variance of V2 
using (X5, Xg, Xj, Xg,) and the second sparse PC explains most of the variance of Vi using {Xi, X2, X^, X4). 
Based on these observations, we set K = A in (|1.4[) for computing both the first and the second sparse PCs. 
When we computed the second sparse PC, we used the Schur complement deflation method described in 
Section!?] to construct the corresponding sample covariance matrix. The penalty parameter fj, in ADMM was 
set to 0.8. The PCs given by the standard PCA and sparse PCA using our ADMM for solving (|1.5p and the 
explained variances are shown in Table 15.11 From Table 15.11 we see that ADMM gives sparse loadings using 
{X5, Xq, Xi, Xg) in the first PC and {Xi, X2, X3, X4) in the second PC. The first two sparse PCs explain 
80.41% of the total variance. 

Table 5.1 

Loadings and explained variance for the first two PCs 





Standard PCA 


ADMM 


Variables 


PCI PC2 


PCI PC2 


Xi 


-0.1157 -0.4785 


0.5000 


X2 


-0.1157 -0.4785 


0.5000 


X3 


-0.1157 -0.4785 


0.5000 


Xi 


-0.1157 -0.4785 


0.5000 


X5 


0.3953 -0.1449 


0.5000 


Xe 


0.3953 -0.1449 


0.5000 


X7 


0.3953 -0.1449 


0.5000 


Xs 


0.3953 -0.1449 


0.5000 


X9 


0.4008 0.0095 





Xio 


0.4008 0.0095 





Total explained variance 


99.68% 


80.41% 



5.2. Pit props data. The pit props data set has been a standard benchmark for testing sparse PCA 
algorithms since it was introduced by Jeffers in [21]. The pit props data set has 180 observations and 13 
measured variables. Thus the covariance matrix S is a 13 x 13 matrix. We used our ADMM to compute the 
first six sparse PCs sequentially via the Schur complement deflation technique discussed in Section l4] We 
set K = (6, 2, 2, 1, 1, 1) for the six problems (fL4)) as suggested in |9]. We set fj. 0.8 in ADMM. The first six 
sparse PCs obtained by ADMM are shown in Table [521 We compared the results with ALSPCA for solving 
(|l.lip . For ALSPCA, we used the parameters as suggested by the authors, i.e., r = 6,p = 0.70, A^^- = 
0.50, Vi 7^ j. The results given by ALSPCA are reported in Table Since there was no clue how to choose 
the six parameters p in the six problems (|1.5p when DSPCA is used to solve them, we did not compare with 
DSPCA for solving pTS)) . From Tables |0 and [53] we see that both ADMM and ALSPCA gave a solution 
with 15 nonzcros in the first six sparse PCs, and the solution given by ADMM explains slightly more variance 
than the solution given by ALSPCA. 

5.3. Random examples. We created some random examples to test the speed of ADMM and com- 
pared it with DSPCA [9] and ALSPCA [24]. The sample covariance matrix S was created by adding 
some small noise to a sparse rank-one matrix. Specifically, we first created a sparse vector i; € with 
s nonzeros randomly chosen from the Gaussian distribution Af{0, 1). We then got the sample covariance 
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Table 5.2 

First six sparse PCs of the pit props data set given by ADMM 



Variables 


PCI 


PC2 


PC3 


PC4 


PC5 


PC6 


Topdiam 


-0.4908 

















Length 


-0.5067 

















Moist 





-0.7175 














Testsg 





-0.6965 














Ovensg 








0.9263 











Ringtop 


-0.0668 





0.3511 











Ringbut 


-0.3565 





0.1369 











Bowmax 


-0.2334 

















Bowdist 


-0.3861 

















Whorls 


-0.4089 

















Clear 











1.0000 








Knots 














1.0000 





Diaknot 

















1.0000 


Total sparsity 


15, total explained variance: 74.31% 





Table 5.3 

First six sparse PCs of the pit props data set given by ALSPCA 



Variables 


PCI 


PC2 


PC3 


PC4 


PC5 


PC6 


Topdiam 


0.4052 

















Length 


0.4248 

















Moist 





-0.7262 














Testsg 


0.0014 


-0.6874 














Ovensg 








-1.0000 











Ringtop 


0.1857 

















Ringbut 


0.4122 

















Bowmax 


0.3277 

















Bowdist 


0.3829 

















Whorls 


0.4437 


0.0021 














Clear 











1.0000 








Knots 














1.0000 





Diaknot 

















1.0000 


Total sparsity 


: 15, tota 


explained variance: 73.29% 





matrix S — xx + <jvv , where a denotes the noise level and ?j G is a random vector with entries 
uniformly drawn from [0, 1]. We applied DSPCA, ALSPCA and ADMM to find the largest sparse PC of 
S. We report the comparison results in Tables 15.41 and 15.51 that correspond to noise levels a = 0.01 and 
a- = 0.1 respectively. When using DSPCA to solve ((T3)) and ALSPCA to solve (|l.lip . we set different p's 
to get solutions with different sparsity levels. Specifically, we tested DSPCA for p = 0.01,0.1,1 in Table 
EH with cr = 0.01 and p = 0.1,1,10 in Table [53] with a = 0.1; we tested ALSPCA for p = 0.01,0.1,1 in 
both Tables and [575] We set different K's in (|1.4p to control the sparsity level when using ADMM to 
solve it. In both Tables 15.41 and 15.51 we tested four data sets with dimension p and sparsity s setting as 
(p,s) = (100, 10), (100,20), (200, 10) and (200,20). We used the following continuation technique for p. in 
ADMM: ^0 = l,A*fe = max{2^fe_i/3, 10"^}. Aij were set to 0.1 for all i ^ j in all the tests for ALSPCA as 
suggested in |24] for tests on random data sets. 

We report the cardinality of the largest sparse PC (Card), the percentage of the explained variance 
(PEV) and the CPU time in Tables [El and EE From Table [El we see that, for a = 0.01, all three 
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algorithms DSPCA, ADMM and ALSPCA are sensitive to the parameters that control the sparsity, i.e., p 
and K . p ~ 0.01 always gave the best results for DSPCA and ALSPCA and the explained variance is very 
close to the standard PCA. p = 0.1 still provided relatively good solutions for DSPCA and ALSPCA in terms 
of both sparsity and the explained variance. When p was increased to 1, the solutions given by ALSPCA 
were too sparse to give a relatively large explained variance; while the solutions given by DSPCA sometimes 
had more nonzcros than the desired sparsity level (when (p, s) — (100,10)), and even when the solutions 
were of the desired sparsity level, the explained variances were affected a lot (when (p, s) = (100,20) and 
(200, 20)). For ADMM, K = 5, 4, 3 were tested for s = 10 and -ftT = 10, 9, 8 were tested for s = 20. Results 
shown in Table indicate that K = s/2 usually produced good results. When K was changed from 5 
to 4 and 3 for s — 10, the sparsity and explained variance of the solution changed a lot. When K was 
changed from 10 to 9 and 8 for s = 20, the solution was not affected too much in terms of the explained 
variance. Especially, for (p, s) — (100, 20) and (200, 20) and K = 8, ADMM gave solutions with sparsity 
13 that explain 96.10% and 93.66% variance respectively, which arc both very close to the results given by 
the standard PCA. From Table [53] we see that, for a = 0.1, i.e., when the noise level was large, DSPCA 
and ALSPCA were more sensitive to the noise compared with their performance when a = 0.01. More 
specifically, in Table [5751 p = 0.1 usually gave a solution with the best explained variance and appropriate 
sparsity for DSPCA, expect for {p, s) = (100, 10), where the solution produced by DSPCA had 21 nonzeros, 
which was much more than the desired sparsity 10. The solutions given by DSPCA for p = 1 and p = 10 
were not very satisfied. For ALSPCA, when (p, s) = (100, 10) and (100, 20), p = 0.1 gave good results, while 
the results given hy p — 0.01 and p — 1 were not very satisfied. However, we observed that the performance 
of ADMM when a = 0.1 was consistent with its performance when a = 0.01, i.e., its performance was not 
very sensitive to the noise. 

From both Tables 15.41 and 15.51 we see that ALSPCA was slightly faster than ADMM, and they were 
both significantly faster than DSPCA. This is reasonable because ALSPCA solves the non-convex problem 
(|l.lip and thus eigenvalue decomposition is not required, which costs most of the computational effort in 
DSPCA and ADMM. 

5.4. Text data classification. Sparse PCA can also be used to classify the keywords in text data. 
This application has been studied by Zhang, d'Aspremont and El Ghaoui in [40] and Zhang and El Ghaoui 
in [41]. In this section, we show that by using our ADMM to solve the sparse PCA problem, we can also 
classify the keywords from text data very well. The data set we used is a small version of the "20- newsgroups" 
datsQ, which is also used in [40]. This data set consists of the binary occurrences of 100 specific words across 
16242 postings, i.e., the data matrix M is of the size 100 x 16242 and Mij = 1 if the i-th word appears 
at least once in the j-th posting and = if the i-th word does not appear in the j-th posting. These 
words can be approximately divided into different groups such as "computer" , "religion" etc. We want to 
find the words that contribute as much variance as possible and also discover which words are in the same 
category. By viewing each posting as a sample of the 100 variables, we have 16424 samples of the variables, 
and thus the sample covariance matrix S G Riooxioo. Using standard PCA, it is hard to interpret which 
words contribute to each of the leading eigenvalues since the loadings are dense. However, sparse PCA can 
explain as much the variance explained by the standard PCs, and meanwhile interpret well which words 
contribute together to the corresponding variance. We applied our ADMM to solve (|1.4|) to find the first 
three sparse PCs. We set K — 5 in all three problems and the following continuation technique was used 
for p: pq ~ 100, pk ~ max{2/^fc_i/3, 10^"*}. The resulting three sparse PCs have 10, 12 and 17 nonzeros 

^This data set can be downloaded f rom |http : / /cs . nyu . edu/ ~roweis /data, htnil] 

15 



Table 5.4 

Comparisons of ADMM, ALSPCA and DSPCA on random examples with a = 0.01 





Method 


Parameters 


Card 


PEV 


CPU 


(100,10) 


PCA 






96.16% 






DSPCA 


p = 0.01 


10 


96.16% 


12.12 






p = 0.1 


10 


95.81% 


8.29 






P = 1 


13 


87.28% 


6.56 




ADMM 


K = 5 


9 


95.30% 


0.26 






K = 4 


6 


91.55% 


0.28 






K = 3 


4 


79.30% 


0.19 




ALSPCA 


p = 0.01 


10 


96.16% 


0.13 






p = 0.1 


9 


96.02% 


0.39 






fl = 1 


4 


89.54% 


0.13 


(100,20) 


PCA 






98.07% 






DSPCA 


p = 0.01 


20 


98.07% 


10.93 






p = 0.1 


20 


97.71% 


8.47 






p=l 


20 


85.25% 


5.75 




ADMM 


if = 10 


20 


97.98% 


0.28 






if = 9 


18 


97.40% 


0.28 






if = 8 


13 


96.10% 


0.31 




ALSPCA 


p = 0.01 


20 


98.07% 


0.13 






p = 0.1 


18 


97.83% 


0.13 






f = 1 


8 


83.87% 


0.13 


(200,10) 


PCA 






91.43% 






DSPCA 


p = 0.01 


10 


91.42% 


88.87 






p = 0.1 


10 


91.09% 


61.36 






p = 1 


8 


82.91% 


45.97 




ADMM 


if 5 


9 


90.61% 


1.23 






if = 4 


6 


87.04% 


1.27 






if = 3 


4 


75.40% 


0.88 




ALSPCA 


p = 0.01 


10 


91.42% 


0.23 






p = 0.1 


9 


91.29% 


0.23 






fl = 1 


4 


85.13% 


0.29 


(200,20) 


PCA 






95.58% 






DSPCA 


p = 0.01 


20 


95.58% 


79.87 






p = 0.1 


20 


95.22% 


63.12 






p = l 


20 


83.09% 


42.22 




ADMM 


K =10 


20 


95.49% 


1.57 






K = 9 


18 


94.93% 


1.67 






K = 8 


13 


93.66% 


1.71 




ALSPCA 


p== 0.01 


20 


95.58% 


0.23 






p = 0.1 


18 


95.34% 


0.23 






P=l 


8 


81.73% 


0.23 



respectively. The total explained variance by these three sparse PCs is 12.72%, while the variance explained 
by the largest three PCs by the standard PCA is 19.10%. 

The words corresponding to the first three sparse PCs generated by our ADMM are listed in Table [^751 
From Table [^751 we see that the words in the first sparse PC are approximately in the category "school", the 
words in the second PC are approximately in the category "religion" , and the words in the third sparse PC 
are approximately in the category "computer" . So our ADMM can classify the keywords into appropriate 
categories very well. 



Table 5.5 

Comparisons of ADMM, ALSPCA and DSPCA on random examples with cr = 0.1 



(v. s) 

yft '-'J 


Method 


Parameters 


Card 


PEV 


CPU 


(100,10) 


PCA 






71.51% 






DSPCA 


p = 0.1 


21 


71.23% 


9.42 






P = 1 

t 


10 


64.92% 


4.40 






p = 10 


1 


27.04% 


4.14 




ADMM 


K = 5 


9 


70.83% 


0.24 






K = 4 


6 


68.03% 


0.25 






K = 3 


4 


58.93% 


0.17 




ALSPCA 


p = 0.01 


31 


71.49% 


0.13 






p = 0.1 


9 


71.39% 


0.13 






fl = 1 


4 


66.53% 


0.13 


(100,20) 


PCA 






83.59% 






DSPCA 


p = 0.1 


20 


83.27% 


8.58 






p = 1 


20 


72.75% 


5.82 






p = 10 


56 


26.80% 


3.01 




ADMM 


K =10 


20 


83.50% 


0.32 






K = 9 


18 


83.02% 


0.40 






K = 8 


13 


81.90% 


0.27 




ALSPCA 


p = 0.01 


25 


83.58% 


0.13 






p = 0.1 


19 


83.37% 


0.13 








8 


71.37% 


0.13 


(200,10) 


PCA 






51.69% 






DSPCA 


p = 0.1 


10 


51.46% 


19.19 






P = 1 

r 


88 


46.95% 


28.51 






p = 10 


1 


19.80% 


28.04 




ADMM 


K = b 


9 


51.15% 


1.22 






K = 4 


6 


49.13% 


1.38 






K = 3 


4 


42.61% 


0.87 




ALSPCA 


p = 0.01 


10 


51.61% 


0.26 






p = 0.1 


9 


51.53% 


0.25 






p = I 


4 


48.01% 


0.24 


(200,20) 


PCA 






68.38% 






DSPCA 


p = 0.1 


20 


68.12% 


74.87 






p=l 


20 


59.54% 


42.63 






p= 10 


64 


22.03% 


34.83 




ADMM 


K = 9 


18 


67.91% 


1.46 






K = 8 


14 


67.01% 


1.74 






K = 7 


11 


65.26% 


1.76 




ALSPCA 


p== 0.01 


20 


68.37% 


0.24 






p = 0.1 


18 


68.20% 


0.25 






P=l 


8 


58.37% 


0.24 



5.5. Senate voting data. In this section, we use sparse PCA to analyze the voting records of the 109th 
US Senate, which was also studied by Zhang, d'Aspremont and El Ghaoui in [JO]. The votes are recorded 
as 1 for "yes" and —1 for "no". Missing votes are recorded as 0. There are 100 senators (55 Republican, 
44 Democratic and 1 independent) and 542 bills involved in the data set. However, there are many missing 
votes in the data set. To obtain a meaningful data matrix, we only choose the bills for which the number of 
missing votes is at most one. There are only 66 such bills among the 542 bills. So our data matrix M is a 
66 X 100 matrix with entries 1,-1 and 0, and each column of M corresponds to one senator's voting. The 



Table 5.6 

Words associated with the first three sparse PCs using ADMM 



1st PC (10 words) 


2nd PC (12 words) 


3rd PC (17 words) 


case 


bible 


computer 


course 


case 


email 


email 


cliristian 


files 


fact 


course 


ftp 


help 


evidence 


graphics 


number 


fact 


number 


problem 


god 


phone 


question 


government 


problem 


system 


human 


program 


university 


jesus 


research 




religion 


science 




world 


software 






space 






state 






university 






version 






windows 


Total sparsity: 


39, total explained variance: 12.72% 



sample covariance matrix E ~ MM in our test is a 66 x 66 matrix. 

To see how standard PCA and sparse PCA perform in classifying the voting records, we implemented 
the following procedure as suggested in [40]. We used standard PCA to find the largest two PCs (denoted 
as vi and V2) of E. We then projected each column of M onto the subspace spanned by vi and V2, i.e., we 
found oii and I3i for each column Mi such that 

(Q;i,/3i) := arg min \\aiVi + I3.,V2 - M^W. 

We then drew each column ALi as a point (a^, Pi) in the two-dimensional subspace spanned by vi and V2- The 
left figure in Figure [5Tl shows the 100 points. We see from this figure that senators are separated very well by 
partisanship. However, it is hard to interpret which bills are responsible to the explained variance, because 
all the bills are involved in the PCs. By using sparse PCA, we can interpret the explained variance by just a 
few bills. We applied our ADMM to find the first two sparse PCs (denoted as si and S2) of E. We set K = A 
for both problems and used the following continuation technique on /i: /iq = 100, /i^ = max{2/j,fe_i/3, 10^^}. 

The resulting two sparse PCs si and S2 produced by our ADMM have 9 and 5 nonzeros respectively. We 
projected each column of M onto the subspace spanned by these two sparse PCs. The right figure in Figure 
15.11 shows the 100 projections onto the subspace spanned by the sparse PCs si and S2- We see from this 
figure that the senators are still separated well by partisanship. Now since only a few bills are involved in the 
two sparse PCs, we can interpret which bills are responsible most for the classification. The bills involved in 
the first two PCs are listed in Table 15.71 From Table 15.71 we see that the most controversial issues between 
Republicans and Democrats are topics such as "Budget" and "Appropriations" . Other controversial issues 
involve topics like "Energy" , "Abortion" and "Health" . 

6. Conclusion. In this paper, we proposed alternating direction method of multipliers to solve an 
SDP relaxation of the sparse PCA problem. Our method incorporated a variable-splitting technique to 
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Table 5.7 

Bills involved in the first two PCs by ADMM 



Bills in the first sparse PC 
Budget, Spending and Taxes_Education Funding Amcndment_3804 
Budget, Spending and Taxes_Rcinstate Pay-As-You-Go through 2011 Amendment_3806 
Energy IssuesXIHEAP Funding Aniendnient_3808 
Abortion Issues_Unintended Pregnancy Aniendnient_3489 
Budget, Spending and TaxesJ3udget FY2006 Appropriations Resolution_3488 
Budget, Spending and Taxes_Budget Reconciliation bilL3665 
Budget, Spending and Taxes_Budget Reconciliation bilL3789 
Budget, Spending and Taxes_Education Amendinent_3490 
Health Issues_Mcdicaid Amendment_3496 

Bills in the second sparse PC 
Appropriations_Agriculturc, Rural Development, FDA Appropriations Act_3677 
Appropriations-Emergency Supplemental Appropriations Act, 2005_3515 
Appropriations-Emergency Supplemental Appropriations Act, 2006.3845 
Appropriationsjnterior Department FY 2006 Appropriations Bill_3595 
Executive BrancliJohn Negroponte, Director of National Intelligence_3505 
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Fig. 5.1. Projection of the senate voting records onto the subspace spanned by the top 2 principal components: Left: 
standard PCA; Right: sparse PCA 

separate the £i norm constraint, which controls the sparsity of the solution, and the positive-semidefiniteness 
constraint. This method resulted in two relatively simple subproblems that have closed-form solutions in 
each iteration. Global convergence results were established for the proposed method. Numerical results on 
both synthetic data and real data from classification of text data and senate voting records demonstrated 
the efficacy of our method. 

Compared with Nesterov's first-order method DSPCA for sparse PGA studied in [5], our ADMM method 
solves the primal problems directly and guarantees sparse solutions. Numerical results also indicate that 
ADMM is much faster than DSPCA. Compared with methods for solving nonconvex formulations of sparse 
PCA, the nonsmooth SDP formulation considered in this paper usually requires more computational effort in 
each iteration. However, the global convergence of our ADMM for solving the nonsmooth SDP is guaranteed, 
while methods for solving nonconvex problems usually have only local convergence. 
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